Personal Semantic Search MCP

Step 1 - File Reader and Text Extraction.md•5.3 KiB

# Step 1 - File Reader and Text Extraction

**Status:** Complete

---

## Goal

Build a module that scans any directory and extracts text content from all supported file types.

---

## Input

```python
extract_all(folder_path: str, max_size: int = 1MB) -> list[Document]
```

## Output

```python
@dataclass
class Document:
    path: str           # absolute file path
    content: str        # extracted text
    file_type: str      # extension (without dot)
    modified: float     # mtime for incremental sync
```

---

## Supported File Types

| Extension | Extraction Method |
|-----------|-------------------|
| `.md`, `.txt`, `.rst` | `read_text()` |
| `.html`, `.htm` | BeautifulSoup `get_text()` |
| `.json` | `json.dumps()` with formatting |
| `.csv` | Headers + first 100 rows as text |
| `.py` | Full file (docstrings + code) |
| `.js`, `.ts`, `.jsx`, `.tsx` | Full file |
| `.pdf` | PyMuPDF `get_text()` |

---

## Skip List

- Binary files (images, video, executables)
- Files > 1MB (configurable via `max_size` param)
- Hidden files/folders (`.git`, `.env`, `.venv`, `.obsidian`, etc.)
- Common non-content dirs (`node_modules`, `__pycache__`, `dist`, `build`)

---

## Dependencies

```
beautifulsoup4
PyMuPDF
```

---

## Implementation

### Files Created

| File                  | Purpose                                             |
| --------------------- | --------------------------------------------------- |
| `file_reader.py`      | Main module with `extract_all()` and all extractors |
| `test_file_reader.py` | Verification tests (6 tests)                        |
| `requirements.txt`    | Dependencies                                        |

### Key Functions

```python
# Main entry point
extract_all(folder_path, max_size) -> list[Document]

# Individual extractors
extract_text(file_path) -> Optional[str]
extract_html(file_path) -> str
extract_json(file_path) -> str
extract_csv(file_path) -> str
extract_pdf(file_path) -> Optional[str]

# Utilities
is_hidden(path) -> bool
should_skip(path, max_size) -> bool
```

---

## Test Results

### Verification Tests

```
==================================================
FILE READER VERIFICATION TESTS
==================================================

  [PASS] Document dataclass
  [PASS] is_hidden()
  [PASS] should_skip()
  [PASS] extract_text() for all types
  [PASS] extract_all() integration
  [PASS] Error handling

--------------------------------------------------
Results: 6 passed, 0 failed
--------------------------------------------------
```

### Real-World Test (Notes Vault)

**Tested:** 2025-12-10

```
=== EXTRACTION RESULTS ===

Total documents extracted: 369

By file type:
  .md:   329  (89%)
  .py:    12
  .txt:   11
  .json:   7
  .html:   6
  .csv:    2
  .bat:    1
  .ps1:    1

Content stats:
  Total characters: 948,132
  Average per doc:    2,569

Largest documents:
  44,658 chars - orchestrator.py
  42,924 chars - mcp-html-fetch-server-prd.md
  21,189 chars - Strengths Cours Draft.md
  20,423 chars - hiring_pipeline_analysis.py
  19,708 chars - MT Hiring Pipeline Analysis - YTD 2025.md

Smallest documents:
  5 chars - -Encoding
  6 chars - Idea.md
  7 chars - path

=== CORRECTLY SKIPPED ===

  .obsidian folders: 2 (Obsidian config)
  Binary files:      4 (png, jpg, pdf)
```

**Verdict:** ✅ Working as expected

---

## What This Means for Step 2

Step 1 provides **raw documents with full text content**. Step 2 needs to:

1. **Chunk the content** - Split large documents into semantic chunks (~500-1000 tokens) for better embedding quality
2. **Preserve metadata** - Each chunk needs to retain `path`, `file_type`, and chunk position
3. **Handle overlap** - Chunks should overlap slightly to avoid breaking context at boundaries

### Interface for Step 2

```python
# Step 1 output -> Step 2 input
documents: list[Document] = extract_all(folder_path)

# Step 2 transforms to:
@dataclass
class Chunk:
    content: str        # chunk text
    source_path: str    # original file
    chunk_index: int    # position in document
    metadata: dict      # file_type, modified, etc.
```

The `modified` timestamp on Document enables **incremental sync** - only re-process files that changed since last index.

---

## Next Step

[[Step 2 - Text Chunking and Overlap]]

---

## Code Review (2025-12-10)

**Reviewer:** Claude
**Verdict:** ✅ Solid Foundation - Ready for Step 2

### Strengths

1. **Clean Architecture** - The `Document` dataclass is well-designed with the `modified` timestamp enabling incremental indexing
2. **Intelligent Skip Logic** - Properly filters hidden files, binary extensions, oversized files, and common non-content directories
3. **Graceful Degradation** - HTML extraction falls back to regex if BeautifulSoup unavailable
4. **Comprehensive Tests** - 6 tests covering unit, integration, and error handling - all passing ✅

### Areas for Future Improvement

| Issue | Impact | Priority |
|-------|--------|----------|
| No `.docx`/`.xlsx` support | Missing Office docs from OneDrive | Medium |
| CSV truncation at 100 rows | May lose data in large spreadsheets | Low |
| UTF-8 only encoding | Older files may fail silently | Low |
| Basic PDF extraction | No page numbers or table extraction | Low |

### Recommendation

**Ship it.** The foundation is solid and the interface for Step 2 is well-defined. Office document support can be added later without architectural changes.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Ethan2298/personal-semantic-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Step 1 - File Reader and Text Extraction.md•5.3 KiB

# Step 1 - File Reader and Text Extraction

**Status:** Complete

---

## Goal

Build a module that scans any directory and extracts text content from all supported file types.

---

## Input

```python
extract_all(folder_path: str, max_size: int = 1MB) -> list[Document]
```

## Output

```python
@dataclass
class Document:
    path: str           # absolute file path
    content: str        # extracted text
    file_type: str      # extension (without dot)
    modified: float     # mtime for incremental sync
```

---

## Supported File Types

| Extension | Extraction Method |
|-----------|-------------------|
| `.md`, `.txt`, `.rst` | `read_text()` |
| `.html`, `.htm` | BeautifulSoup `get_text()` |
| `.json` | `json.dumps()` with formatting |
| `.csv` | Headers + first 100 rows as text |
| `.py` | Full file (docstrings + code) |
| `.js`, `.ts`, `.jsx`, `.tsx` | Full file |
| `.pdf` | PyMuPDF `get_text()` |

---

## Skip List

- Binary files (images, video, executables)
- Files > 1MB (configurable via `max_size` param)
- Hidden files/folders (`.git`, `.env`, `.venv`, `.obsidian`, etc.)
- Common non-content dirs (`node_modules`, `__pycache__`, `dist`, `build`)

---

## Dependencies

```
beautifulsoup4
PyMuPDF
```

---

## Implementation

### Files Created

| File                  | Purpose                                             |
| --------------------- | --------------------------------------------------- |
| `file_reader.py`      | Main module with `extract_all()` and all extractors |
| `test_file_reader.py` | Verification tests (6 tests)                        |
| `requirements.txt`    | Dependencies                                        |

### Key Functions

```python
# Main entry point
extract_all(folder_path, max_size) -> list[Document]

# Individual extractors
extract_text(file_path) -> Optional[str]
extract_html(file_path) -> str
extract_json(file_path) -> str
extract_csv(file_path) -> str
extract_pdf(file_path) -> Optional[str]

# Utilities
is_hidden(path) -> bool
should_skip(path, max_size) -> bool
```

---

## Test Results

### Verification Tests

```
==================================================
FILE READER VERIFICATION TESTS
==================================================

  [PASS] Document dataclass
  [PASS] is_hidden()
  [PASS] should_skip()
  [PASS] extract_text() for all types
  [PASS] extract_all() integration
  [PASS] Error handling

--------------------------------------------------
Results: 6 passed, 0 failed
--------------------------------------------------
```

### Real-World Test (Notes Vault)

**Tested:** 2025-12-10

```
=== EXTRACTION RESULTS ===

Total documents extracted: 369

By file type:
  .md:   329  (89%)
  .py:    12
  .txt:   11
  .json:   7
  .html:   6
  .csv:    2
  .bat:    1
  .ps1:    1

Content stats:
  Total characters: 948,132
  Average per doc:    2,569

Largest documents:
  44,658 chars - orchestrator.py
  42,924 chars - mcp-html-fetch-server-prd.md
  21,189 chars - Strengths Cours Draft.md
  20,423 chars - hiring_pipeline_analysis.py
  19,708 chars - MT Hiring Pipeline Analysis - YTD 2025.md

Smallest documents:
  5 chars - -Encoding
  6 chars - Idea.md
  7 chars - path

=== CORRECTLY SKIPPED ===

  .obsidian folders: 2 (Obsidian config)
  Binary files:      4 (png, jpg, pdf)
```

**Verdict:** ✅ Working as expected

---

## What This Means for Step 2

Step 1 provides **raw documents with full text content**. Step 2 needs to:

1. **Chunk the content** - Split large documents into semantic chunks (~500-1000 tokens) for better embedding quality
2. **Preserve metadata** - Each chunk needs to retain `path`, `file_type`, and chunk position
3. **Handle overlap** - Chunks should overlap slightly to avoid breaking context at boundaries

### Interface for Step 2

```python
# Step 1 output -> Step 2 input
documents: list[Document] = extract_all(folder_path)

# Step 2 transforms to:
@dataclass
class Chunk:
    content: str        # chunk text
    source_path: str    # original file
    chunk_index: int    # position in document
    metadata: dict      # file_type, modified, etc.
```

The `modified` timestamp on Document enables **incremental sync** - only re-process files that changed since last index.

---

## Next Step

[[Step 2 - Text Chunking and Overlap]]

---

## Code Review (2025-12-10)

**Reviewer:** Claude
**Verdict:** ✅ Solid Foundation - Ready for Step 2

### Strengths

1. **Clean Architecture** - The `Document` dataclass is well-designed with the `modified` timestamp enabling incremental indexing
2. **Intelligent Skip Logic** - Properly filters hidden files, binary extensions, oversized files, and common non-content directories
3. **Graceful Degradation** - HTML extraction falls back to regex if BeautifulSoup unavailable
4. **Comprehensive Tests** - 6 tests covering unit, integration, and error handling - all passing ✅

### Areas for Future Improvement

| Issue | Impact | Priority |
|-------|--------|----------|
| No `.docx`/`.xlsx` support | Missing Office docs from OneDrive | Medium |
| CSV truncation at 100 rows | May lose data in large spreadsheets | Low |
| UTF-8 only encoding | Older files may fail silently | Low |
| Basic PDF extraction | No page numbers or table extraction | Low |

### Recommendation

**Ship it.** The foundation is solid and the interface for Step 2 is well-defined. Office document support can be added later without architectural changes.